{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# User's Guide, Chapter 11: Corpus Searching\n",
    "\n",
    "One of music21's important features is its capability to help users examine\n",
    "large bodies of musical works, or *corpora*.  \n",
    "\n",
    "Music21 comes with a substantial corpus called the *core* corpus. When you\n",
    "download music21 you can immediately start working with the files in the \n",
    "corpus directory, including the complete chorales of Bach, many Haydn and\n",
    "Beethoven string quartets, three books of madrigals by Monteverdi, thousands\n",
    "of folk songs from the Essen and various ABC databases, and many more.\n",
    "\n",
    "To load a file from the corpus, simply call *corpus.parse* and assign that\n",
    "file to a variable:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from music21 import *\n",
    "bach = corpus.parse('bach/bwv66.6')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "sphinx_links": {
     "any": true
    }
   },
   "source": [
    "The `music21` core corpus comes with many thousands of works.  All of them (or at least all the collections) are listed on the :ref:`Corpus Reference <referenceCorpus>`.\n",
    "\n",
    "Users can also build their own corpora to index and quickly search their own collections on disk including\n",
    "multiple local corpora, for different projects, that can be accessed individually.\n",
    "\n",
    "This user's guide will cover more about the corpus's basic features.  This\n",
    "chapter focuses on music21's tools for extracting useful metadata - titles,\n",
    "locations, composers names, the key signatures used in each piece, total\n",
    "durations, ambitus (range) and so forth.\n",
    "\n",
    "This metadata is collected in *metadata bundles* for each corpus. The *corpus*\n",
    "module has tools to search these bundles and persist them disk for later\n",
    "research.\n",
    "\n",
    "## Types of corpora\n",
    "\n",
    "Music21 works with three categories of *corpora*, made explicit via the\n",
    "``corpus.Corpus`` abstract class.\n",
    "\n",
    "The first category is the *core* corpus, a large collection of musical works\n",
    "packaged with most music21 installations, including many works from the common\n",
    "practice era, and inumerable folk songs, in a variety of formats:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3189"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coreCorpus = corpus.corpora.CoreCorpus()\n",
    "len(coreCorpus.getPaths())"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "..  note::\n",
    "\n",
    "    If you've installed a \"no corpus\" version of music21, you can still access\n",
    "    the *core* corpus with a little work.  Download the *core* corpus from\n",
    "    music21's website, and install it on your system somewhere. Then, teach\n",
    "    music21 where you installed it like this:    \n",
    "\n",
    "    >>> coreCorpus = corpus.corpora.CoreCorpus()\n",
    "    >>> #_DOCS_SHOW coreCorpus.manualCoreCorpusPath = 'path/to/core/corpus'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Local Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Music21` also can have one or more *local* corpora--bodies of works provided and\n",
    "configured by individual music21 users for their own research. They will be covered in :ref:`Chapter 53<usersGuide_53_advancedCorpus>`.  Anyone wanting to use them can jump ahead immediately to that chapter, but for now we'll continue with searching in the core corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "localCorpus = corpus.corpora.LocalCorpus()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can add and remove paths from a *local* corpus with the ``addPath()`` and\n",
    "``removePath()`` methods:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('/Users/josiah/Desktop',)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "localCorpus.addPath('~/Desktop')\n",
    "#_DOCS_SHOW localCorpus.directoryPaths\n",
    "('/Users/josiah/Desktop',) #_DOCS_HIDE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Currently, after adding paths to a corpus, you'll need to rebuild the cache."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#_DOCS_SHOW corpus.cacheMetadata()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We hope that this won't be necessary in the future.\n",
    "\n",
    "To remove a path, use the `removePath()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": [
     "nbval-ignore-output"
    ]
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/cuthbert/git/music21base/music21/corpus/corpora.py: WARNING: local metadata cache: starting processing of paths: 0\n",
      "/Users/cuthbert/git/music21base/music21/corpus/corpora.py: WARNING: cache: filename: /var/folders/mk/qf43gd_s5f30rzzbt7l7l01h0000gn/T/music21/local.p.gz\n",
      "bundles.py: WARNING: MetadataBundle Modification Time: 1598896535.8389168\n",
      "bundles.py: WARNING: Skipped 0 sources already in cache.\n",
      "/Users/cuthbert/git/music21base/music21/corpus/corpora.py: WARNING: cache: writing time: 0.012 md items: 0\n",
      "\n",
      "/Users/cuthbert/git/music21base/music21/corpus/corpora.py: WARNING: cache: filename: /var/folders/mk/qf43gd_s5f30rzzbt7l7l01h0000gn/T/music21/local.p.gz\n"
     ]
    }
   ],
   "source": [
    "localCorpus.removePath('~/Desktop')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, a call to ``corpus.parse`` or ``corpus.search`` will look for\n",
    "files in any corpus, core or local."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple searches of the corpus\n",
    "\n",
    "When you search the corpus, music21 examines each metadata object in the\n",
    "metadata bundle for the whole corpus and attempts to match your search string \n",
    "against the contents of\n",
    "the various search fields saved in that metadata object.  \n",
    "\n",
    "You can use ``corpus.search()`` to search the metadata associated with all\n",
    "known corpora, *core*, *virtual* and even each *local* corpus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {2161 entries}>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#_DOCS_SHOW sixEight = corpus.search('6/8')\n",
    "sixEight = corpus.corpora.CoreCorpus().search('6/8') #_DOCS_HIDE\n",
    "sixEight"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To work with all those pieces, you can parse treat the MetadataBundle\n",
    "like a list and call ``.parse()`` on any element:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": [
     "nbval-ignore-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"I'll Touzle your Kurchy.\""
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "myPiece = sixEight[0].parse()\n",
    "myPiece.metadata.title"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will return a ``music21.stream.Score`` object which you can work\n",
    "with like any other stream. Or if you just want to see it, there's a \n",
    "convenience ``.show()`` method you can call directly on a MetadataEntry.\n",
    "\n",
    "You can also search against a single ``Corpus`` instance, like this one\n",
    "which ignores anything in your local corpus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {2161 entries}>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus.corpora.CoreCorpus().search('6/8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the result of every metadata search is also a metadata bundle, you can\n",
    "search your search results to do more complex searches.  Remember that \n",
    "`bachBundle` is a collection of all works where the composer is Bach.  Here we\n",
    "will limit to those pieces in 3/4 time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {363 entries}>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bachBundle = corpus.search('bach', 'composer')\n",
    "bachBundle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {40 entries}>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bachBundle.search('3/4')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metadata search fields\n",
    "\n",
    "When you search metadata bundles, you can search either through every search\n",
    "field in every metadata instance, or through a single, specific search field.\n",
    "As we mentioned above, searching for \"bach\" as a composer renders different \n",
    "results from searching for the word \"bach\" in general:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {363 entries}>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#_DOCS_SHOW corpus.search('bach', 'composer')\n",
    "corpus.corpora.CoreCorpus().search('bach', 'composer') #_DOCS_HIDE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {20 entries}>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#_DOCS_SHOW corpus.search('bach', 'title')\n",
    "corpus.corpora.CoreCorpus().search('bach', 'title') #_DOCS_HIDE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {564 entries}>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#_DOCS_SHOW corpus.search('bach')\n",
    "corpus.corpora.CoreCorpus().search('bach') #_DOCS_HIDE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So what fields can we actually search through? You can find out like this (in v2, replace `corpus.manager` with `corpus.corpora.Corpus`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "actNumber\n",
      "alternativeTitle\n",
      "ambitus\n",
      "associatedWork\n",
      "collectionDesignation\n",
      "commission\n",
      "composer\n",
      "copyright\n",
      "countryOfComposition\n",
      "date\n",
      "dedication\n",
      "groupTitle\n",
      "keySignatureFirst\n",
      "keySignatures\n",
      "localeOfComposition\n",
      "movementName\n",
      "movementNumber\n",
      "noteCount\n",
      "number\n",
      "numberOfParts\n",
      "opusNumber\n",
      "parentTitle\n",
      "pitchHighest\n",
      "pitchLowest\n",
      "popularTitle\n",
      "quarterLength\n",
      "sceneNumber\n",
      "sourcePath\n",
      "tempoFirst\n",
      "tempos\n",
      "textLanguage\n",
      "textOriginalLanguage\n",
      "timeSignatureFirst\n",
      "timeSignatures\n",
      "title\n",
      "volume\n"
     ]
    }
   ],
   "source": [
    "for field in corpus.manager.listSearchFields():\n",
    "    print(field)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This field will grow in the future now that the development team is seeing\n",
    "how useful this searching method can be! Now that we know what all the search\n",
    "fields are, we can search through some of the more obscure corners of the\n",
    "*core* corpus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {27 entries}>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#_DOCS_SHOW corpus.search('taiwan', 'locale')\n",
    "corpus.corpora.CoreCorpus().search('taiwan', 'locale') #_DOCS_HIDE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if you are not searching for an exact match?  If you're searching for\n",
    "short pieces, you probably don't want to find pieces with exactly 1 note then\n",
    "union that set with pieces with exactly 2 notes, etc.  Or for pieces from the\n",
    "19th century, you won't want to search for 1801, 1802, etc.  What you can do is\n",
    "set up a \"predicate callable\" which is a function (either a full python ``def``\n",
    "statement or a short ``lambda`` function) to filter the results.  Each piece\n",
    "will be checked against your predicate and only those that return true.  Here\n",
    "we'll search for pieces with between 400 and 500 notes, only in the ``core``\n",
    "corpus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {212 entries}>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predicate = lambda x: 400 < x < 500\n",
    "corpus.corpora.CoreCorpus().search(predicate, 'noteCount')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also pass in compiled regular expressions into the search.  In this case we will use a regular expression likely to find Handel and Haydn and perhaps not much else:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {186 entries}>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "haydnOrHandel = re.compile(r'ha.d.*', re.IGNORECASE)\n",
    "#_DOCS_SHOW corpus.search(haydnOrHandel)\n",
    "corpus.corpora.CoreCorpus().search(haydnOrHandel) #_DOCS_HIDE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unfortunately this really wasn't a good search, since we also got folk songs with the title of \"Shandy\".  Best to use a '\\*\\^\\*' search to match at the beginning of the word only:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<music21.metadata.bundles.MetadataBundle {15 entries}>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "haydnOrHandel = re.compile(r'^ha.d.*', re.IGNORECASE)\n",
    "#_DOCS_SHOW corpus.search(haydnOrHandel)\n",
    "corpus.corpora.CoreCorpus().search(haydnOrHandel) #_DOCS_HIDE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've now gone fairly high level in our searching.  We will return to the lowest level in :ref:`Chapter 12: The Music21Object <usersGuide_12_music21Object>`"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
